Workload-Aware Web Crawling and Server Workload Detection

نویسندگان

  • Shaozhi Ye
  • Guohan Lu
  • Xing Li
چکیده

With the development of search engines, more and more web crawlers are used to gather web pages. The rising crawling traffic has brought the concern that crawlers may impact web sites. On the other hand, more efficient crawling strategy is required for the coverage and freshness of search engine index. In this paper, crawlers of several major search engines are analyzed using one six-months access log of a busy web site. Surprisingly, we find that none of these crawlers pays attention to the workload of web site, which may hurt both server performance and crawling efficiency. Based on this observation, a server workload-aware crawling strategy is proposed. By measuring the web service time with a hybrid back-to-back packets pair, server workload is detected on the client side, thus crawler can adapt its crawling speed to web server. The experiment results show the power of our workload detection approach. This paper concludes with a discussion of future work on server workload detection and its applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Load Balancing on the Internet

Introduction 1 Workload Characteristics of Internet Services 2 Web Applications 3 Streaming Applications 4 Taxonomy of Load-Balancing Strategies 4 Load Balancing in the Server, the Network, and the Client Sides 4 State-Blind versus State-Aware Load Balancing 5 Load Balancing at Different Network Layers 5 Server-Side Load Balancing 5 DNS-Based Load Balancing 5 Dispatcher-Based Load Balancing 7 S...

متن کامل

Scalable Web Server Cluster Design with Workload-Aware Request Distribution Strategy WARD

Request Distribution Strategy WARD Ludmila Cherkasova and Magnus Karlsson Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA 94303 e-mail: fcherkasova,[email protected] Abstract. In this work, we consider a web cluster in which the content-aware distribution is performed by each of the node in a web cluster. Each server in the cluster may forward a request to another node based...

متن کامل

Disk-aware Request Distribution-based Web Server Power Management

This work is concerned with reducing power consumption by cluster-based web servers. We focused on server hard disks, a major source of server power consumption. We started with the modification of Logsim, a simulator for cluster-based web servers. The new simulator, NLogsim behaves exactly like a cluster-based web server handling requests when they arrive. Based on NLogsim, we exposed the rela...

متن کامل

Disk-aware Request Distribution-based Web Server Disk Power Management

This report presents studies, implementation, and simulation we conducted for the course project of COS518, Fall 2003. The course project was concerned with reducing power consumption by cluster-based web servers. We focused on server hard disks, a major source of server power consumption. We started with the modification of Logsim, a simulator for cluster-based web servers. The new simulator, ...

متن کامل

A Model for Web Workload Generation Based on Content Classification

Web server performance is tightly bound to the workload the server has to support. Therefore, understanding the nature of the server workload is particularly important in capacity planning and overload control of Web servers. Web performance analysis can be done, a priori, with a synthetic generation of Web system workload. However, performance analysis results depend on the accuracy of this wo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004